Welcome to the Exploratory Data Analysis of the CRAN historical data set. As you may already know, CRAN is a network of servers around the world which store code and documentation for the R packages over time. As of writing this EDA, CRAN had just over 18,000 packages available in it’s repository.
CRAN Website
Heads or Tails has done a great job of grabbing historical data, cleaning it up and preparing it for us R enthusiasts. Read about the approach he followed in his blogpost.
Read through the initial setup in the 4 tabs below.
First, some I import some useful libraries and set some plotting defaults.
# Data Manipulation
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(purrr)
library(stringr)
library(urltools)
library(magrittr)
# Plots
library(ggplot2)
library(naniar)
library(packcircles)
library(ggridges)
# Tables
library(reactable)
# Settings
theme_set(theme_minimal(
base_size = 14,
base_family = "Menlo"))
theme_update(
plot.title.position = "plot"
)
Let’s start be reading in the data. There are two CSV
files in this dataset. From his dataset
page:
cran_package_overview.csv: all R packages currently
available through CRAN, with (usually) 1 row per package…cran_package_history.csv: version history of virtually
all packages in the previous table...hist_dt <- read_csv(
"../input/r-package-history-on-cran/cran_package_history.csv",
col_types = cols(
package = col_character(),
version = col_character(),
date = col_date(format = "%Y-%m-%d"),
repository = col_character()
)
)
ov_dt <- read_csv(
"../input/r-package-history-on-cran/cran_package_overview.csv",
col_types = cols(
package = col_character(),
version = col_character(),
depends = col_character(),
imports = col_character(),
license = col_character(),
needs_compilation = col_logical(),
author = col_character(),
bug_reports = col_character(),
url = col_character(),
date_published = col_date(format = "%Y-%m-%d"),
description = col_character(),
title = col_character()
)
)
glimpse(hist_dt, 80)
Rows: 119,464
Columns: 4
$ package <chr> "A3", "A3", "A3", "AATtools", "ABACUS", "abbreviate", "abby…
$ version <chr> "0.9.1", "0.9.2", "1.0.0", "0.0.1", "1.0.0", "0.1", "0.1", …
$ date <date> 2013-02-07, 2013-03-26, 2015-08-16, 2020-06-14, 2019-09-20…
$ repository <chr> "Archive", "Archive", "CRAN", "CRAN", "CRAN", "CRAN", "Arch…
glimpse(ov_dt, 80)
Rows: 18,388
Columns: 12
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", …
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", …
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file…
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues"…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "htt…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 201…
$ description <chr> "Supplies tools for tabulating and analyzing the res…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics f…
I love to take the first peek into a dataset with the amazing {skimr}
package. We can see that we have the right data types set for all the
columns, dates have been imported correctly.
We can see in the history data that the 1st package
reported on CRAN was in 22 years ago on 1998-02-25!
Furthermore, the overview tells us there’s a package
{pack} last published/updated on 2008-09-08.
While there’s no missing data in the history dataset,
there are a bunch of missing values in the overview
dataset. Let’s explore this a bit more.
skimr::skim(hist_dt)
── Data Summary ────────────────────────
Values
Name hist_dt
Number of rows 119464
Number of columns 4
_______________________
Column type frequency:
character 3
Date 1
________________________
Group variables None
── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 package 0 1 2 32 0 18372 0
2 version 0 1 3 15 0 10074 0
3 repository 0 1 4 7 0 2 0
── Variable type: Date ───────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
1 date 0 1 1998-02-25 2022-07-17 2018-01-22 7119
skimr::skim(ov_dt)
── Data Summary ────────────────────────
Values
Name ov_dt
Number of rows 18388
Number of columns 12
_______________________
Column type frequency:
character 10
Date 1
logical 1
________________________
Group variables None
── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 package 0 1 2 32 0 18371 0
2 version 0 1 3 14 0 2112 0
3 depends 4862 0.736 2 218 0 4289 0
4 imports 4026 0.781 2 573 0 12041 0
5 license 0 1 3 54 0 159 0
6 author 0 1 4 4096 0 15652 0
7 bug_reports 10760 0.415 11 81 0 7578 0
8 url 8426 0.542 4 466 0 9548 0
9 description 0 1 5 7372 0 18368 0
10 title 0 1 5 210 0 18306 0
── Variable type: Date ───────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
1 date_published 1 1.00 2008-09-08 2022-07-17 2021-03-22 2542
── Variable type: logical ────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean count
1 needs_compilation 0 1 0.239 FAL: 13991, TRU: 4397
My favorite way of exploring missing data is to make it visible,
using Nick Tierney’s
amazing {naniar}
package. There are a few columns with missing data. Let’s look at these
more closely.
depends and imports have roughly a quarter
of the data as <NA>. These are packages which
roughly have no external dependencies. The difference between
the two can get a bit complex; best to learn about it in Hadley’s
chapter here.bug_reports and url are
missing. These don’t seem to be data issues as much as authors who don’t
have a place to issue bugs or website for their package
respectively.date_published has only 1 row with missing, which seems
like a data quality spill.ov_dt %>%
dplyr::arrange(date_published) %>%
vis_miss()
Since this is an open ended exploration - unlike other EDA with the purpose of building a predictive model - before I continue to the plotting, I’d like to posit some questions which will guide the flow of further work. The first five questions are from Martin’s blog, with further questions which I think would be interesting to explore.
To aid answering many of these, I first need to create a few new
features in the overview data set.
Read about the feature development in the tabs below. We go from
12 columns to 29 columns in the overview data set.
Per the R package section in Hadley’s book,
“an R package version is a sequence of at least two integers
separated by either . or -. For example,
1.0 and 0.9.1-10 are valid versions, but
1 and 1.0-devel are not”. Typically,
packages do follow the three number format of
<major>.<minor>.<patch>. I’m making an
assumption this is true, just to simplify things. I have a feeling it’ll
capture most of the cases.
This feature could help answer questions about version number progressions.
Adding 3 columns here…
split_versions <- function(dat) {
stopifnot("version" %in% names(dat))
dat %>%
separate(
version,
into =
c("major", "minor", "patch"),
sep = "\\.",
extra = "merge", # for versions like 1.0.3-3000, keep the '3-3000' together in the 3rd col
fill = "right",
remove = FALSE
)
}
ov_dt <- ov_dt %>% split_versions()
hist_dt <- hist_dt %>% split_versions()
glimpse(ov_dt, 100)
Rows: 18,388
Columns: 15
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
For the last published version of the package, how many dependencies and/and imports does each package have? My hypothesis is that packages in the past relied on lesser dependencies since they were more likely than not written in base R. With the recent explosion of adoption of R, and the adoption of the tidyverse framework, more recent packages would have a larger set of dependencies.
Adding 2 columns here…
ov_dt <- ov_dt %>%
mutate(
# Dependencies
num_dep = purrr::map_int(
.x = depends,
.f = function(x){
x %>%
stringr::str_split(",", simplify = TRUE) %>%
length()
}
),
num_dep = ifelse(is.na(depends), 0, num_dep),
# Imports
num_imports = purrr::map_int(
.x = imports,
.f = function(x){
x %>%
stringr::str_split(",", simplify = TRUE) %>%
length()
}
),
num_imports = ifelse(is.na(imports), 0, num_imports)
)
glimpse(ov_dt, 100)
Rows: 18,388
Columns: 17
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
Temporal features typically useful for aggregation downstream.
Adding 6 columns here…
hist_dt <- hist_dt %>%
mutate(
year = lubridate::year(date),
month = lubridate::month(date, label = TRUE),
day = lubridate::day(date),
wday = lubridate::wday(date, label = TRUE),
yr_mon = sprintf("%d-%s", year, month),
dt = lubridate::ym(paste0(year, "-", month))
)
ov_dt <- ov_dt %>%
filter(!is.na(date_published)) %>%
mutate(
year = lubridate::year(date_published),
month = lubridate::month(date_published, label = TRUE),
day = lubridate::day(date_published),
wday = lubridate::wday(date_published, label = TRUE),
yr_mon = sprintf("%d-%s", year, month),
dt = lubridate::ym(paste0(year, "-", month))
)
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 24
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
How long are the titles and description fields in the latest package submissions? Any interesting trends over time?
Adding 2 columns here…
ov_dt <- ov_dt %>%
mutate(
len_title = purrr::map_int(title, ~ stringr::str_count(.x, "\\w+")),
len_desc = purrr::map_int(description, ~ stringr::str_count(.x, "\\w+"))
)
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 26
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
The raw dataset has 159 unique levels for the license
variable.
ov_dt %>%
count(license) %>%
reactable(compact = TRUE)
But many of being quite similar to each other, some binning is in
order to extract some patterns. Here, I use the case_when
to bin together similar licenses. (I’m no expert in these licenses. I’m
sure I’m taking some liberties in the grouping here).
Adding 1 column here…
ov_dt <- ov_dt %>%
mutate(
license_cleaned = case_when(
str_detect(license, "^GPL-3") ~ "GPL-3",
str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*3") ~ "GPL-3",
str_detect(license, "^GPL-2") ~ "GPL-2",
str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*2") ~ "GPL-2",
str_detect(license, "^AGPL") ~ "AGPL",
str_detect(license, "^LGPL") ~ "LGPL",
str_detect(license, "Apache") ~ "Apache",
str_detect(license, "BSD") ~ "BSD",
str_detect(license, "LGPL") ~ "LGPL",
str_detect(license, "MIT") ~ "MIT",
str_detect(license, "CC0") ~ "CC0",
license == "GPL" ~ "GPL",
TRUE ~ "Other"
# str_detect(license, "GNU") ~ "GNU", # Left these out after some trials with plots below
# str_detect(license, "MPL") ~ "MPL",
# str_detect(license, "Unlimited") ~ "Unlimited",
# str_detect(license, "^CC") ~ "CC",
)
)
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 27
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
$ license_cleaned <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
Which domains do package authors typically use? My guess is GitHub rules them all, but is that true? Can we see any rise of other offerings like GitLab or BitBucket?
Adding 2 columns here…
ov_dt <- ov_dt %>%
mutate(url_domain = map_chr(url,
~ {
if (is.na(.x))
return(NA)
else
return(url_parse(.x)$domain)
}),
bug_domain = map_chr(bug_reports,
~ {
if (is.na(.x))
return(NA)
else
return(url_parse(.x)$domain)
}))
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 29
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
$ license_cleaned <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
$ url_domain <chr> NA, NA, "shiny.abdn.ac.uk", "github.com", "github.com", NA, NA, NA, NA, …
$ bug_domain <chr> NA, "github.com", NA, NA, "github.com", NA, NA, NA, NA, NA, "github.com"…
Now that I have the data sets prepared and ready, it’s time for the fun part - being creative and creating some interesting visuals! Let’s attack those questions one at a time.
Q: How have the dependencies & imports changed over time?
Here’s a time series plot of the number of imports and dependencies since 2008. Since the values are integers, adding some jitter adds some much needed separation in the individual values.
The data supports my hypothesis that more recent packages would have a larger set of dependencies. We’re currently at a median of 6 imports. But look at the explosion of package dependencies in the recent past!
Also, what happens mid-2015? There’s a clear elbow in the trend right at that time. It doesn’t seem organic; did some popular package get released which others used as dependenies? did ÇRAN’s measurement system change?
plot_dotplot_ts <- function(dat,
title,
dday = 90,
pcutoff = 0.95,
size = 1,
alpha = 0.1){
stopifnot(all(c("dt", "date_published", "y", "median_y") %in% names(dat)))
p95 <- quantile(dat$y, pcutoff)
dat %>%
ggplot(aes(x = dt)) +
geom_jitter(aes(y = y), alpha = alpha, size = size) +
geom_smooth(aes(y = median_y), color = "red") +
scale_x_date(
date_breaks = "1 year",
date_labels = "'%y",
minor_breaks = NULL,
expand = c(NA, 0.2)
) +
scale_y_continuous(minor_breaks = NULL, limits = c(NA, p95)) +
theme(
panel.grid = element_blank(),
axis.ticks.x.bottom = element_line(size = 0.8, colour = "gray"),
axis.ticks.y.left = element_line(size = 0.8, colour = "gray"),
plot.margin = margin(0, 50, 0, 50)
) +
labs(title = title,
caption = sprintf("Y axis clipped at %0.2f percentile", pcutoff),
x = NULL, y = NULL) +
coord_cartesian(clip = "off") +
annotate(
"text",
x = max(dat$date_published) + lubridate::ddays(90),
y = max(dat$median_y),
label = sprintf("Median: %d", max(dat$median_y)),
vjust = 0.5,
hjust = 0,
color = "red"
)
}
ov_dt %>%
select(date_published, dt, y = num_imports) %>%
timetk::pad_by_time(date_published, .pad_value = 0) %>%
mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>%
group_by(dt) %>%
mutate(median_y = median(y)) %>%
plot_dotplot_ts(
title = "How have package `imports` changed over time?",
dday = 90
)
ov_dt %>%
select(date_published, dt, y = num_dep) %>%
timetk::pad_by_time(date_published, .pad_value = 0) %>%
mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>%
group_by(dt) %>%
mutate(median_y = median(y))%>%
plot_dotplot_ts(
title = "How have package `dependencies` changed over time?",
dday = 90
)
Another interesting way to look at the same data is by ridge plots
using {ggridges}. It’s
easy to see the large spread of descriptions and how it’s been
increasing over time.
deps <- ov_dt %>%
select(year,
`Description Length` = len_desc,
`Title Length` = len_title) %>%
arrange(-year) %>%
filter(!is.na(year), year > 2008) %>%
mutate(year = factor(year, levels = seq(2008, 2022)))
deps %>%
pivot_longer(-year) %>%
ggplot(aes(y = year, x = value, fill = name)) +
stat_density_ridges(
bandwidth = 4,
scale = .95,
quantile_lines = TRUE,
quantiles = 2,
alpha = 0.7,
rel_min_height = 0.01
) +
scale_x_continuous(limits = c(0, 200), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_ridges(center = TRUE) +
theme(legend.position = "top",
legend.title = element_blank()) +
labs(
title = "Distribution of Description & Title Lengths since 2010",
x = NULL,
y = NULL
)
Q: What license is most used? Has there been a change over time?
plot_bubbles <- function(dat,
.scale,
plot_radius,
bubble_radius,
alpha,
maxiter) {
.qty <- nrow(dat)
theta <- seq(0, 360, length.out = .qty + 1)
dat$x <- plot_radius * cos(theta * pi / 180)[-1]
dat$y <- plot_radius * sin(theta * pi / 180)[-1]
dat$n_scaled <- dat$n / .scale
xpack <- rep(dat$x, times = dat$n_scaled)
ypack <- rep(dat$y, times = dat$n_scaled)
coords <- tibble(
x = xpack + runif(length(xpack)),
y = ypack + runif(length(ypack)),
r = bubble_radius
)
packed_coords <-
circleRepelLayout(coords, sizetype = "r", maxiter = maxiter)
packed_coords$layout %>%
ggplot(aes(x, y)) +
geom_point(aes(size = radius), alpha = alpha) +
coord_equal() +
theme_minimal() +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()
) +
geom_text(
aes(
x = x,
y = y,
label = label
),
data = dat,
hjust = "center",
vjust = "center"
)
}
ov_dt %>%
count(license_cleaned) %>%
top_n(6, n) %>%
mutate(label = sprintf("%s\n%d Pkgs", license_cleaned, n)) %>%
arrange(runif(1:n())) %>%
plot_bubbles(
.scale = 100,
plot_radius = 10,
bubble_radius = 0.56,
alpha = 0.2,
maxiter = 1000
)
ov_dt %>%
group_by(license_cleaned) %>%
count() %>%
ggplot(aes(x = forcats::fct_reorder(license_cleaned, n), y = n, fill = license_cleaned)) +
geom_col() +
coord_flip() +
theme_minimal() +
guides(fill = "none") +
labs(x = "", y = "")
ov_dt %>%
group_by(dt) %>%
count(license_cleaned) %>%
mutate(license_cleaned = forcats::fct_reorder(license_cleaned, n)) %>%
ggplot(aes(x= dt, y = n, color = license_cleaned)) +
# geom_line( alpha = 0.3) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.3, se = FALSE) +
theme_light()
Q: Do packages have URLs for bug reports?
ov_dt %>%
group_by(dt) %>%
count(url_exist = is.na(url)) %>%
ggplot(aes(x= dt, y = n, color = url_exist)) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.3, se = FALSE) +
theme_light()
ov_dt %>%
group_by(dt) %>%
count(url_exist = is.na(bug_reports)) %>%
ggplot(aes(x= dt, y = n, color = url_exist)) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.3, se = FALSE) +
theme_light()
Q: Which repositories do packages use? Github/Bitbucket etc. How do these vary over time?
ov_dt %>%
filter(bug_domain != "") %>%
mutate(bug_domain = forcats::fct_lump_min(bug_domain, 20)) %>%
group_by(dt) %>%
count(bug_domain) %>%
ggplot(aes(x= dt, y = n, color = bug_domain)) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.5, se = FALSE) +
theme_light()
plot_bubbles <- function(dat,
.scale,
plot_radius,
bubble_radius,
alpha,
maxiter) {
.qty <- nrow(dat)
theta <- seq(0, 360, length.out = .qty + 1)
dat$x <- plot_radius * cos(theta * pi / 180)[-1]
dat$y <- plot_radius * sin(theta * pi / 180)[-1]
dat$n_scaled <- dat$n / .scale
xpack <- rep(dat$x, times = dat$n_scaled)
ypack <- rep(dat$y, times = dat$n_scaled)
coords <- tibble(
x = xpack + runif(length(xpack)),
y = ypack + runif(length(ypack)),
r = bubble_radius
)
packed_coords <-
circleRepelLayout(coords, sizetype = "r", maxiter = maxiter)
packed_coords$layout %>%
ggplot(aes(x, y)) +
geom_point(aes(size = radius), alpha = alpha) +
coord_equal() +
theme_minimal() +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()
) +
geom_text(
aes(
x = x,
y = y,
label = label
),
data = dat,
hjust = "center",
vjust = "center"
)
}
ov_dt %>%
count(license_cleaned) %>%
top_n(6, n) %>%
mutate(label = sprintf("%s\n%d Pkgs", license_cleaned, n)) %>%
arrange(runif(1:n())) %>%
plot_bubbles(
.scale = 100,
plot_radius = 10,
bubble_radius = 0.56,
alpha = 0.2,
maxiter = 1000
)
# .scale <- 100
# .qty <- 6
# lic <- ov_dt %>%
# group_by(license_cleaned) %>%
# count() %>%
# arrange(-n) %>%
# head(.qty) %>%
# mutate(n_scaled = round(n / .scale)) %>%
# arrange(runif(1:n()))
#
# r <- 10
# theta <- seq(0, 360, length.out = .qty+1)
# lic$x <- r * cos(theta * pi / 180)[-1]
# lic$y <- r * sin(theta * pi / 180)[-1]
# xpack <- rep(lic$x, times=lic$n_scaled)
# ypack <- rep(lic$y, times=lic$n_scaled)
#
# coords <- tibble(x=xpack+runif(length(xpack)),
# y=ypack+runif(length(ypack)),
# r=.56)
# packed_coords <- circleRepelLayout(coords, sizetype="r", maxiter=1000)
# packed_coords$layout %>%
# ggplot(aes(x, y)) +
# geom_point(aes(size = radius), alpha = 0.2) +
# coord_equal() +
# theme_minimal() +
# theme(legend.position = "none",
# panel.grid = element_blank(),
# axis.title = element_blank(),
# axis.text = element_blank()) +
# geom_text(aes(x = x,
# y = y,
# label = sprintf("%s\n%d Pkgs", license_cleaned, n)),
# data = lic,
# hjust = "center",
# vjust = "center")
Q: Is there any temporal patterns to when versions are submitted to CRAN?
ov_dt %>%
filter(!is.na(dt), dt < "2022-07-01") %>%
count(dt) %>%
arrange(dt) %>%
timetk::pad_by_time(dt, .by = "month", .pad_value = 0) %>%
ggplot(aes(dt, n)) +
geom_line()
ov_dt %>%
filter(!is.na(dt), dt < "2022-07-01") %>%
count(dt) %>%
arrange(dt) %>%
timetk::pad_by_time(dt, .by = "month", .pad_value = 0) -> xdat
timetk::plot_seasonal_diagnostics(xdat, dt, log(n), .interactive = FALSE)
timetk::plot_stl_diagnostics(xdat %>% filter(dt > "2018-01-01", dt < "2022-07-01"), dt, n, .interactive = FALSE, .feature_set = c("observed", "season", "trend", "remainder"))
frequency = 12 observations per 1 year
trend = 12 observations per 1 year